A short description of the post.
In the country island of Kronos, the increasing noxious effects on health and farming have been related to the uncontrolled activity of GAStech, a natural gas operator, supported by corrupt government officials. On January 20th, 2014, a corporate meeting is held to celebrate the new-found fortune because of the initial public offering of the company. However, a series of rare events occur that lead to the disappearance of several employees. The Protectors of Kronos (POK), a social movement organization that has been fighting against water contamination and government corruption, is suspected in the disappearance.
As analysts, we were assigned with several tasks in order to identify risks and how they could have been mitigated more effectively.
Using data and visual analytics to evaluate the changing levels of risk to the public and recommend actions for first responder:
Distinguish meaningful event reports from typical chatter from junk or spam.
Evaluate the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. Determine the appropriate location for first responders.
The differences between dealing with this challenge in 2014 and dealing with it now.
First, we run this fist line of code to clear the environment and remove existing R objects(if any).
This code chunk checks if required packages are installed. If they are not installed, the next line of code will install them. The following line is then use to import the library into the current working environment.
packages = c('readr','tidytext','data.table','lubridate','ggplot2',
'caret','dplyr','tidyr','scales','quanteda','textdata',
'stringr','stringi','reshape2','RColorBrewer','wordcloud',
'forcats','igraph','ggraph','widyr','clock','knitr','tidyverse',
'DT','hms','ggiraph','topicmodels','raster','sf','maptools',
'rgdal','ggmap','sp','tmap','tmaptools','devtools','patchwork')
for(p in packages){
if(!require(p,character.only = TRUE)){
install.packages(p)
}
library(p,character.only = TRUE)
}
First, use read_csv() to import the csv file.
read1 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1700-1830.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read2 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-1831-2000.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
read3 <- read_csv("F:/visual/assignment and project/MC3/MC3/csv-2001-2131.csv",
col_types = list(col_character(),col_character(),col_character(),
col_character(),col_double(),col_double(),
col_character()))
Using function rbind() combine these three csv files with the same format.
df <- rbind.data.frame(read1,read2,read3)
glimpse(df)
Rows: 4,063
Columns: 7
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <chr> "20140123170000", "20140123170000", "~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
From the above table the date(yyyyMMddHHmmss) is not in time format, so converting to date-time field. Because in the mini-challenge 3, all activities occur on the same day.Extract time(hms) data without date and transform.
df$`date(yyyyMMddHHmmss)` <- date_time_parse(df$`date(yyyyMMddHHmmss)`,
zone = "",
format = "%Y%m%d %H%M%S")
df$time <- as_hms(ymd_hms((df$`date(yyyyMMddHHmmss)`)))
glimpse(df)
Rows: 4,063
Columns: 8
$ type <chr> "mbdata", "mbdata", "mbdata", "mbdata~
$ `date(yyyyMMddHHmmss)` <dttm> 2014-01-23 17:00:00, 2014-01-23 17:0~
$ author <chr> "POK", "maha_Homeland", "Viktor-E", "~
$ message <chr> "Follow us @POK-Kronos", "Don't miss ~
$ latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "~
$ time <time> 17:00:00, 17:00:00, 17:00:00, 17:00:~
In question 1, extract the required data from the df to make a new data frame. After reading all the data carefully, the terminology of mbdata and ccdata is very different, so separate the two files. Since ccdata is a police or fire department record, this dataset is labeled as a meaningful dataset.
df1 <- subset(df, select = c("type","author","message"))
df_m <- subset(df1, type == "mbdata")
df_cc <- subset(df1,type == "ccdata")
df_cc$condition <- "meaningful"
DT::datatable(df_cc,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px',targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
JUNk definition: After reading all the data carefully, I have selected the following types of data.
Authors like “KronosQuoth” will post content related to the three types of activities(rally, fire to collapsed, accident to gunshot) in this challenge, but cannot or rarely provide important information such as location. The content is mostly emotional and cathartic.
Content that includes the term “#Grammar” is mostly unrelated to important events in mini-challenge 3.
Content containing “RT” is a forwarding of other people’s messages, providing the same information repeatedly. And, the number of “RT” message is very large.
junk <- df_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
Meaningful definition: After reading all the data carefully, I have selected the following types of data.
Information released by authors who are more official or authoritative. Author like “POK”,“AbilaPoliceDepartment”.
Information released by authors are witnesses to events and publishes information that provides important information. Authors like “magaMan”,“Sara_Nespola”.
Content that includes terms like “fire”,“rally” is mostly related to the important events in mini-challenge 3.
meaningful <- df_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
Meaningless definition: After reading all the data carefully, I have selected the following types of data.
Content that includes terms like “fire”,“rally” which is related to the events but is used to express emotions and can not provide important information.
Content that is unrelated to the events, but has their own objectives like sharing the information of shops.
This group is obtained by subtracting other groups from the df_m through anti_join() function.
meaningful <- dplyr::anti_join(meaningful,junk,by = c("type", "author", "message"))
combinedata <- rbind.data.frame(meaningful, junk)
meaningless <- dplyr::anti_join(df_m,combinedata,by = c("type", "author", "message"))
Combine meaningful,meaningless and junk data and add a new label column.
junk$condition <- "junk"
meaningful$condition <- "meaningful"
meaningless$condition <- "meaningless"
finalq1 <- rbind.data.frame(meaningful,junk, meaningless)
DT::datatable(finalq1,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Use stringr package to remove punctuation, @, #, < and Chinese characters from messages. The messages in ccdata are very clean and do not require special handling.
finalq1$message <- str_replace_all(finalq1$message,'[[:punct:]]+', "")
finalq1$message <- str_replace_all(finalq1$message,fixed("@"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("#"),"")
finalq1$message <- str_replace_all(finalq1$message,fixed("<"),"")
finalq1$message <- str_replace_all(finalq1$message,"[\u4e00-\u9fa5]+", "")
The messages in the table below is clean.
Exclude stop words from the text and use tibble() to custom stop words selected according to the content of the text.
tidy_m <- finalq1 %>%
unnest_tokens(word, message) %>%
count(condition,word,sort = TRUE)
data(stop_words)
tidy_m <- tidy_m %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
tidy_m <- tidy_m %>%
anti_join(my_stopwords)
tidy_cc <- df_cc %>%
unnest_tokens(word,message) %>%
count(word, sort = TRUE)
DT::datatable(tidy_m,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
The graph below is the word number distribution of junk,meaningful and meaningless group. The junk message has high repetition rate of words.
ggplot(tidy_m,aes(n,fill = condition))+
geom_histogram(show.legend = FALSE)+
scale_fill_brewer(palette = "Pastel1")+
xlim(0,100)+
facet_wrap(~condition, ncol = 2,scales = "free_y")

The graph below is the top 15 word(n) of junk,meaningful and meaningless group.
Junk group
Most of them are words related to POK Rally, used to vent feelings or words related to success. There are also statements against the government and police. Words like “success”,“life”.
There are also obvious spam features such as rt and grammar. These message number is very large and do not provide useful information.
Meaningful group
Words can provide information about the important people of the events. Words like “viktore”.
Words can provide information on the important processes of the events. Words like “standoff”.
Words can provide information on the important locations of the events. words like “dolphin”,“dancing”.
Meaningless group
Content sent by individuals for certain purposes.
Content sent by businesses to promote their company or products. Word like “credit” is issued to promote a credit card.
tidy_m %>%
group_by(condition) %>%
slice_max(n, n= 15) %>%
ungroup() %>%
mutate(word = reorder_within(word,n,condition)) %>%
ggplot(aes(x = n,
y= word,
fill =condition))+
geom_col(show.legend = FALSE)+
scale_fill_brewer(palette = "Pastel1")+
scale_y_reordered()+
facet_wrap(~ condition, ncol = 2,scales = "free")+
ggtitle("mbdata") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

The graph of ccdata shows that the events like “fire”,“traffic”. Word like “vehicle” is issued to the event–“From hit-and-run accident to shooting and standoff”.
tidy_cc %>%
slice_max(n,n = 15) %>%
ggplot(aes(x = n,
y= reorder(word,n)))+
geom_col(show.legend = FALSE, fill = "darkgoldenrod1")+
ggtitle("ccdata_meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.45))+
labs(y = NULL)

The conclusion is similar to Simple EDA.
wordcloud_m <- tidy_m
wordcloud_m <- finalq1 %>%
filter(condition == "meaningful") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "meaningless") %>%
unnest_tokens(word, message)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
count(word,sort = TRUE)%>%
with(wordcloud(word,n,max.words = 100))
wordcloud_m <- finalq1 %>%
filter(condition == "junk") %>%
unnest_tokens(word, message)%>%
count(word,sort = TRUE)%>%
anti_join(stop_words) %>%
anti_join(my_stopwords) %>%
with(wordcloud(word,n,max.words = 100))
tidy_cc %>%
with(wordcloud(word,n,max.words = 100))

Use the bind_tf_idf() to find the important words for the different categories.
Junk: The graph shows that for junk message, retweets(like “rt”), social media accounts(like “kronosstar”) and emotive(like “success”) words are important component.
Meaningful: The graph shows that for meaningful message, government departments(like “apd”,“afd”) or emergency services, important people at events(like “sylvia”) are important component.
Meaningless: The graoh shows that for meaningless message, company event slogans(like “nobanks”) are important component.
m_tf_idf <- tidy_m %>%
bind_tf_idf(word,condition,n)
m_tf_idf %>%
arrange(desc(tf_idf))
# A tibble: 4,111 x 6
condition word n tf idf tf_idf
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 junk rt 1000 0.0558 1.10 0.0613
2 junk kronosstar 884 0.0493 0.405 0.0200
3 junk homelandilluminations 183 0.0102 1.10 0.0112
4 junk grammar 157 0.00876 1.10 0.00962
5 junk abilapost 330 0.0184 0.405 0.00746
6 junk rally 260 0.0145 0.405 0.00588
7 meaningless cards 9 0.00521 1.10 0.00572
8 meaningless easycreditkronosmorecredit 9 0.00521 1.10 0.00572
9 meaningless nobanks 9 0.00521 1.10 0.00572
10 meaningful abilapost 70 0.0137 0.405 0.00555
# ... with 4,101 more rows
m_tf_idf %>%
group_by(condition) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
mutate(word = reorder_within(word,tf_idf,condition)) %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = condition))+
scale_fill_brewer(palette = "Pastel1")+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~condition,ncol = 2,scales = "free")+
ggtitle("mbdata") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

Use the bigram to find the important phrases for the different categories.
Actually,there is no big difference for these three categories. But meaningful messages contain more specific time and place phrases.
meaningful_bigrams <- meaningful %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningful_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 5,775 x 2
bigram n
<chr> <int>
1 viktor e 48
2 of the 42
3 dancing dolphin 41
4 in the 40
5 at the 38
6 abila centralbulletin 30
7 pok rally 28
8 to the 24
9 dr newman 23
10 dolphin fire 20
# ... with 5,765 more rows
meaningful_separated <- meaningful_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningful_filtered <- meaningful_separated %>%
filter(!word1 %in% my_stopwords) %>%
filter(!word2 %in% my_stopwords)
meaningful_filtered <- meaningful_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningful_counts <- meaningful_filtered %>%
count(word1, word2, sort = TRUE)
meaningful_graph <- meaningful_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
junk_bigrams <- junk %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
junk_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 7,338 x 2
bigram n
<chr> <int>
1 pokrally hi 670
2 kronosstar pokrally 598
3 pok rally 234
4 rt homelandilluminations 183
5 rt abilapost 169
6 rally grammar 157
7 rt kronosstar 143
8 if you 126
9 of the 115
10 you can 102
# ... with 7,328 more rows
junk_separated <- junk_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
junk_filtered <- junk_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
junk_counts <- junk_filtered %>%
count(word1, word2, sort = TRUE)
junk_graph <- junk_counts %>%
filter(n > 50) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
meaningless_bigrams <- meaningless %>%
unnest_tokens(bigram,message,token = "ngrams", n = 2)
meaningless_bigrams %>%
count(bigram, sort = TRUE)
# A tibble: 2,051 x 2
bigram n
<chr> <int>
1 badprofiles.kronos tacky 12
2 of the 12
3 viktor e 10
4 abila nobanks 9
5 abila pictures 9
6 cards get 9
7 credit cards 9
8 easy credit 9
9 easycredit.kronos morecredit 9
10 get what 9
# ... with 2,041 more rows
meaningless_separated <- meaningless_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
meaningless_filtered <- meaningless_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
meaningless_counts <- meaningless_filtered %>%
count(word1, word2, sort = TRUE)
meaningless_graph <- meaningless_counts %>%
filter(n > 4) %>%
graph_from_data_frame()
set.seed(2020)
a <- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(meaningful_graph,layout = "fr")+
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()

We can distinguish the differences according to the characteristics of the words used in each category.
Junk: The characteristics of the words in junk category:
Most of them are words related to POK Rally, used to vent feelings or words related to success. There are also statements against the government and police. Words like “success”,“life”.
There are also obvious spam features such as rt and grammar. These message number is very large and do not provide useful information.
There are many social media accounts(like “kronosstar”).
Meaningful: The characteristics of the words in meaningful category:
Words can provide information about the important people of the events. Words like “viktore”.
Words can provide information on the important processes of the events and important government or emergency services. Words like “standoff”,“afd” and “apd”.
Words can provide information on the important locations of the events. words like “dolphin”,“dancing”.
Meaningless: The characteristics of the words in meaningless category:
Content sent by individuals for certain purposes.
Content sent by businesses to promote their company or products. Word like “credit” is issued to promote a credit card.
Repeating the process above and get a whole meaningful dataset.
q2_m <- subset(df, type == "mbdata")
q2_cc <- subset(df,type == "ccdata")
q2_junk <- q2_m %>%
filter(str_detect(author,"KronosQuoth|Clevvah4Eva|choconibbs|trollingsnark|
blueSunshine|whiteprotein|FriendsOfKronos|junkman377|
junkman995|redisrad|roger_roger|cheapgoods998|rockinHW|
panopticon|dels4realz|eazymoney|cleaningFish")|
str_detect(message,"#Grammar|RT"))
q2_meaningful <- q2_m %>%
filter(str_detect(author,"POK|AbilaPost|CentralBulletin|ourcountryyourrights|
MindOfKronos|Viktor-E|maha_Homeland,anaregents|wordWatcher|InternationalNews|
HomelandIlluminations|NewsOnlineToday|AbilaPoliceDepartment|KronosStar|magaMan|
Sara_Nespola|protoGuy|SiaradSea|AbilaFire|footfingers|truthforcadau|truccotrucco|
dangermice|trapanitweets|sofitees|brewvebeenserved|hennyhenhendrix")|
str_detect(message,[2191 chars quoted with '"']))
q2_meaningful <- dplyr::anti_join(q2_meaningful,q2_junk)
q2_lda <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
q2_lda <- na.omit(q2_lda)
tidy_q2 <- q2_lda %>%
unnest_tokens(word,message)
q2_wordcount <- tidy_q2 %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_wordcount <- q2_wordcount %>%
anti_join(my_stopwords) %>%
count(author,word,sort = TRUE)
q2_wordcount
# A tibble: 453 x 3
author word n
<chr> <chr> <int>
1 footfingers pok 21
2 truccotrucco standoff 21
3 truccotrucco im 12
4 truccotrucco shooting 12
5 truccotrucco gelatogalore 9
6 truccotrucco van 9
7 truthforcadau viktor 8
8 footfingers kronos 7
9 dangermice abilafire 6
10 footfingers people 6
# ... with 443 more rows
q2_dtm <- q2_wordcount %>%
cast_dfm(author,word,n)
q2_author_lda <- LDA(q2_dtm,k = 3, control = list(seed = 1234))
q2_topics <- tidy(q2_author_lda,matrix = "beta")
q2_topics
# A tibble: 1,092 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.37e- 1
2 2 pok 2.64e-82
3 3 pok 1.84e- 2
4 1 standoff 1.25e-81
5 2 standoff 2.41e- 4
6 3 standoff 7.63e- 2
7 1 im 3.95e-82
8 2 im 5.06e- 5
9 3 im 4.28e- 2
10 1 shooting 1.24e-87
# ... with 1,082 more rows
q2_topics %>%
group_by(topic) %>%
top_n(10,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)
q2_author_lda2 <- LDA(q2_dtm,k = 4, control = list(seed = 1234))
q2_topics2 <- tidy(q2_author_lda2,matrix = "beta")
q2_topics2
# A tibble: 1,456 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.37e- 1
2 2 pok 6.08e-121
3 3 pok 1.92e- 2
4 4 pok 3.27e-115
5 1 standoff 6.36e-121
6 2 standoff 2.68e- 2
7 3 standoff 6.71e- 2
8 4 standoff 2.70e-104
9 1 im 3.87e-121
10 2 im 1.34e- 2
# ... with 1,446 more rows
q2_topics2 %>%
group_by(topic) %>%
top_n(10,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)
q2_author_lda3 <- LDA(q2_dtm,k = 5, control = list(seed = 1234))
q2_topics3 <- tidy(q2_author_lda3,matrix = "beta")
q2_topics3
# A tibble: 1,820 x 3
topic term beta
<int> <chr> <dbl>
1 1 pok 1.75e- 1
2 2 pok 1.00e-159
3 3 pok 1.92e- 2
4 4 pok 3.64e-154
5 5 pok 7.14e- 2
6 1 standoff 1.46e-157
7 2 standoff 2.68e- 2
8 3 standoff 6.71e- 2
9 4 standoff 1.03e-129
10 5 standoff 4.84e-154
# ... with 1,810 more rows
q2_topics3 %>%
group_by(topic) %>%
top_n(5,beta) %>%
ungroup() %>%
mutate(term = reorder_within(term,beta,topic)) %>%
ggplot(aes(beta,term,fill = topic))+
scale_y_reordered()+
geom_col(show.legend = FALSE)+
facet_wrap(~topic,ncol = 2,scales = "free")+
ggtitle("meaningful") +
theme(plot.title = element_text(size=10,
hjust = 0.4))+
labs(y = NULL)

Using the important words of pok rally event to choose the pok rally relevant messages from the meaningful dataset.
q2_rally_m <- q2_meaningful %>%
filter(str_detect(message,"pokrally|Abila City Park|Stand Up Speak Up|Sylvia Marek|Audrey McConnell Newman, Professor Lorenzo Di Stefano|Lucio Jakab|Viktor-E|Sylvia|Marek|Newman|Stefano|Di Stefano|Lucio|Jakab"))
q2_rally_cc <- q2_cc %>%
filter(str_detect(message,"ABILA CITY PARK|CROWD"))
q2_rally <- rbind(q2_rally_m,q2_rally_cc)
DT::datatable(q2_rally,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_rally$message <- str_replace_all(q2_rally$message,'[[:punct:]]+', "")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("@"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("#"),"")
q2_rally$message <- str_replace_all(q2_rally$message,fixed("<"),"")
q2_rally$message <- str_replace_all(q2_rally$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_rally_tidy <- q2_rally %>%
unnest_tokens(word, message)
data(stop_words)
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_rally_tidy <- q2_rally_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_rally_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Using the important words of “Fire in Dancing Dolphin” event to choose the fire relevant messages from the meaningful dataset.
q2_fire_m <- q2_meaningful %>%
filter(str_detect(message,"fire|dolphin|dancing|building|apartment|Madeg|dispatch|afd|floor|floors|fireman|firefighters|firefighter|evacuate|evacuated|evacuating|evacuation|trapped|injuries|scene|trapped|collapsed|blaze|escalated"))
q2_fire_cc <- q2_cc %>%
filter(str_detect(message,"Fire|Crime|scene"))
q2_fire <- rbind(q2_fire_m,q2_fire_cc)
DT::datatable(q2_fire,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_fire$message <- str_replace_all(q2_fire$message,'[[:punct:]]+', "")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("@"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("#"),"")
q2_fire$message <- str_replace_all(q2_fire$message,fixed("<"),"")
q2_fire$message <- str_replace_all(q2_fire$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_fire_tidy <- q2_fire %>%
unnest_tokens(word, message)
data(stop_words)
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_fire_tidy <- q2_fire_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_fire_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Using the important words of “From hit-and-run accident to shooting and standoff” event to choose the fire relevant messages from the meaningful dataset.
q2_accident_m <- q2_meaningful %>%
filter(str_detect(message,"shooting|stanoff|hostage|swat|negotiation|fight|arrest|hit|van| driver|bicyclist|accident|incident|bike|L829|pursuit|gun|shot|kill|dead|yelling|screaming|negotiatingnegotiator|caught|over|end|shoot|shot|chasing"))
q2_accident_cc <- q2_cc %>%
filter(str_detect(message,"van|pursuit|accident|vandalism|swat"))
q2_accident <- rbind(q2_accident_m,q2_accident_cc)
DT::datatable(q2_accident,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '400px', targets = c(4))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
Clean the message content.
q2_accident$message <- str_replace_all(q2_accident$message,'[[:punct:]]+', "")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("@"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("#"),"")
q2_accident$message <- str_replace_all(q2_accident$message,fixed("<"),"")
q2_accident$message <- str_replace_all(q2_accident$message,"[\u4e00-\u9fa5]+", "")
Token the dataset.
q2_accident_tidy <- q2_accident %>%
unnest_tokens(word, message)
data(stop_words)
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(stop_words)
my_stopwords <- tibble(word = c("zawahiri","yikes","yehu","yeah",
"yay","ya","xx3942","wuz","wow",
"dr"))
q2_accident_tidy <- q2_accident_tidy %>%
anti_join(my_stopwords)
DT::datatable(q2_accident_tidy,filter = 'top',
extensions = 'Buttons',
options = list(autoWidth = FALSE, columnDefs = list(list(width = '60px', targets = c(0:3))),
dom='Bfrtip',
buttons=c('copy', 'csv', 'excel', 'print', 'pdf')))
From the processing of the data, it was found that there were three events in the mini-challenge 3.
The POK rally event, which focuses on the start time of the rally, the time of the different speakers’ speech and the end time.
Fire in Dancing Dolphin, which focused on the start of the fire, rescue, evacuation and the final explosion.
From hit-and-run accident to shooting and standoff, which focused on the traffic accident, the hostage situation, the confrontation with the police, the negotiations and the rescue of the hostages.
POK rally was held on Abila City Park. About 2000 people gathered there and heavy police was disposed.
q2_rally_tidy%>%
filter(str_detect(word,"rally")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CCFF")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="rally")
q2_rally_tidy%>%
filter(str_detect(word,"pok")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CCFF")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="pok")

The most important one in the speech is the POK leader Sylvia Marek, who played a similar role to host. At the begining, POK leader Sylvia Marek opened with a speech and introduced the guests, Dr. Audrey McConnell Newman, Professor Lorenzo Di Stefano and Lucio Jakab. From the diagram, there are intermittent references to Sylvia Marek throughout the speech.
q2_rally_tidy%>%
filter(str_detect(word,"sylvia|marek")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="sylvia|marek")

+ The first speaker is Lucio Jakab who is the co-founder of Save Our Wildlands.He told that they had a symbiotic relationship with the earth, wildlands are their legacy for the children.
+ After the speech of Jakab, Viktor-E came to the stage to play the new song "River Soldiers".
+ The third speaker is professor Lorenzo Di Stefano made the talk about corporate social responsibility.
+ The next talk was delivered by McConnell Newman, who was an internationally renown environmental scientist.
+ The speech ended at 18:45. Sylvia Marek thanked Dr. Newman and welcame Viktor-E back on stage to sing their hit "Stand Up Speak Up". The rally closed around 19:05.
q2_rally_tidy%>%
filter(str_detect(word,"lucio|jakab")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")
q2_rally_tidy%>%
filter(str_detect(word,"viktor")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="viktor")
q2_rally_tidy%>%
filter(str_detect(word,"lorenzo|di|stefano")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="lorenzo|di|stefano")
q2_rally_tidy%>%
filter(str_detect(word,"audrey|mcConnell|newman")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#6699CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y ="audrey|mcConnell|newman")

q2_fire_tidy%>%
filter(str_detect(word,"fire|dolphin|dancing|building|apartment")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = NULL)
q2_fire_tidy%>%
filter(str_detect(word,"floor|floors|upper||resident|trapped")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"afd|police|cop|cops|fireman|firefighter")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FFCCCC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"ambulance|injury|injuries|evacuated|evacuating|evacuation|evacuate")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#FF99CC")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_fire_tidy%>%
filter(str_detect(word,"control")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#CC6699")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = "control")
q2_fire_tidy%>%
filter(str_detect(word,"collapsed|blaze|escalated|explosion")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#993366")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)

q2_accident_tidy%>%
filter(str_detect(word,"hit|run|van|bicyclist|driver|incident|accident|bike|pursuit")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#99CC99")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y = NULL)
q2_accident_tidy%>%
filter(str_detect(word,"gun|shoot|shot|histage")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#00CC66")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"killed|dead")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#009900")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"standoff|negotiating|negotiate|negotiator|negotiation|yelling|screaming|chasing")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#006600")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)
q2_accident_tidy%>%
filter(str_detect(word,"end|over|ocaught|rescued")) %>%
ggplot(aes(x = time)) +
geom_histogram(fill = "#003300")+
coord_cartesian(xlim = c(61200,77460))+
theme(panel.grid=element_blank(),axis.text.x= element_text(angle=60, hjust= 1))+
labs(y =NULL)

bgmap <- raster("F:/visual/assignment and project/MC3/MC3/Geospatial/MC2-tourist.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
abila_st <- st_read(dsn = "F:/visual/assignment and project/MC3/MC3/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`F:\visual\assignment and project\MC3\MC3\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
abila <- read_sf("F:/visual/assignment and project/MC3/MC3/Geospatial/Abila.shp")
q3_gps <- subset(q2_meaningful,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_m <- na.omit(q3_gps)
p <- gps_m %>%
count(longitude,latitude)
p$n <- as.numeric(p$n)
gps_sf <- st_as_sf(p,
coords = c("longitude","latitude"),
crs = 4326)
gps_point <- gps_sf %>%
st_cast("MULTIPOINT")
q3_rally_gps <- subset(q2_rally_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_rally_m <- na.omit(q3_rally_gps)
gps_rally_sf <- st_as_sf(gps_rally_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_rally_gps <- na.omit(q3_rally_gps)
rally_count <- q3_rally_gps %>%
count(longitude,latitude)
rally_count$n <- as.numeric(rally_count$n)
gps_rally_sf <- st_as_sf(rally_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_rally_point <- gps_rally_sf %>%
st_cast("MULTIPOINT")
q3_fire_gps <- subset(q2_fire_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_fire_m <- na.omit(q3_fire_gps)
gps_fire_sf <- st_as_sf(gps_fire_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_fire_gps <- na.omit(q3_fire_gps)
fire_count <- q3_fire_gps %>%
count(longitude,latitude)
fire_count$n <- as.numeric(fire_count$n)
gps_fire_sf <- st_as_sf(fire_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_fire_point <- gps_fire_sf %>%
st_cast("MULTIPOINT")
q3_accident_gps <- subset(q2_accident_m,select = c("type","date(yyyyMMddHHmmss)","author",
"message","latitude","longitude","time"))
gps_accident_m <- na.omit(q3_accident_gps)
gps_accident_sf <- st_as_sf(gps_accident_m,
coords = c("longitude","latitude"),
crs = 4326)
q3_accidnet_gps <- na.omit(q3_accident_gps)
accident_count <- q3_accident_gps %>%
count(longitude,latitude)
accident_count$n <- as.numeric(accident_count$n)
accident_count <- na.omit(accident_count)
gps_accident_sf <- st_as_sf(accident_count,
coords = c("longitude","latitude"),
crs = 4326)
gps_accident_point <- gps_accident_sf %>%
st_cast("MULTIPOINT")
tmap_mode("view")
tm_shape(bgmap,point.per = "feature")+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_point,is.master = TRUE, point.per = 'feature')+
tm_dots(size ="n")
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_rally_point) +
tm_dots(size = "n")
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_fire_point) +
tm_dots(size = "n")
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(r=1,g=2,b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_accident_point) +
tm_dots(size = "n")